2024/01/10
Statistics vs Data Science
Hypothesis Testing vs Hypothesis Creating
Problem Solving vs Understanding Issues (to Attack Problems)
Start RStudio
Create or open a project
R Notebook, R Markdown, Quarto (or R Script, Console)
Reproducibility, Literate Programming and for Communication
R Notebook: HTML file with R Markdown source file
Data themes [Link]
People [Link]
Life-expectancy
R Notebook [Link to the source], [R Notebook]
World Development Indicators: life-expectancy
Week 3, Jan. 10: EDA2, WDI, data transformation, data visualization
Week 4, Jan. 17: EDA3, WDI, tidy data (long and wide data), choropleth maps
Week 5, Jan. 24: EDA4, UN, OECD, readr, readxl, two table verves
Week 6, Jan. 31: EDA5, Round-up, communication, pdf, Word, PowerPoint
CRAN Package Site: WDI [https://CRAN.R-project.org/package=WDI]
WDI: World Development Indicators and Other World Bank Data
The WDI function provides convenient access to over 40 databases hosted by the World Bank, including the World Development Indicators (WDI), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national poverty indicators. For fast searching, the WDI package ships with a local list of available data series. This local list can be updated to the latest version using the WDIcache function.
Search: CRAN package=WDI.
See URL, News and Reference Manual in the page.
For other packages use a similar syntax ‘CRAN package=’Package Name’
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.4 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.1 ## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df_lifeexp <- WDI(indicator = c(pop = "SP.POP.TOTL",
fertility = "SP.DYN.TFRT.IN",
lifeexp = "SP.DYN.LE00.IN"),
extra = TRUE)
Default values are filled automatically. See Help WDI
df_lifeexp <- WDI(country = "all",
indicator = c(pop = "SP.POP.TOTL",
fertility = "SP.DYN.TFRT.IN",
lifeexp = "SP.DYN.LE00.IN"),
start = 1960, end = NULL,
extra = TRUE, cache = NULL, latest = NULL,
language = "en")
dplyr, a package in tidyverse
select() : select columns select(c(1,2,4,5,7,8))
filter() : select rows meeting a condition
filter(country == "World")
filter(iso2c %in% c("JP", "IN", "CN"))
filter(region != "Aggregates")
distinct() : select rows with distinct values of a variable distinct(country)
drop_na() : drop rows with NA values, e.g. drop_na(pop)
ggplot2
ggplot(aes(x=var1, y=var2)) +
Scatter Plot: geom_point(aes(col=var3, size=var4, shape=var5))
Line Graph: geom_line(aes(col=var3, linetype=var4)))
Title, Subtitle, Legend and Axis labels
labs(title = "", subtitle ="", x = "", y = "", col = "")df_life |> filter(country == "World") |> drop_na(life_expectancy) |> ggplot(aes(year, life_expectancy)) + geom_line() + labs(title = "Life expectancy of the World")
df_life |> filter(year == 2021) |> filter(region != "Aggregates") |> drop_na(fertility_rate, life_expectancy) |> ggplot(aes(fertility_rate, life_expectancy, col = region)) + geom_point()
See Help.
wdicache <- WDIcache()
str(wdicache)
## List of 2 ## $ series :'data.frame': 24460 obs. of 5 variables: ## ..$ indicator : chr [1:24460] "1.0.HCount.1.90usd" "1.0.HCount.2.5usd" "1.0.HCount.Mid10to50" "1.0.HCount.Ofcl" ... ## ..$ name : chr [1:24460] "Poverty Headcount ($1.90 a day)" "Poverty Headcount ($2.50 a day)" "Middle Class ($10-50 a day) Headcount" "Official Moderate Poverty Rate-National" ... ## ..$ description : chr [1:24460] "The poverty headcount index measures the proportion of the population with daily per capita income (in 2011 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income (in 2005 PPP"| __truncated__ "The poverty headcount index measures the proportion of the population with daily per capita income below the of"| __truncated__ ... ## ..$ sourceDatabase : chr [1:24460] "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" "LAC Equity Lab" ... ## ..$ sourceOrganization: chr [1:24460] "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of SEDLAC (CEDLAS and the World Bank)." "LAC Equity Lab tabulations of data from National Statistical Offices." ... ## $ country:'data.frame': 297 obs. of 9 variables: ## ..$ iso3c : chr [1:297] "ABW" "AFE" "AFG" "AFR" ... ## ..$ iso2c : chr [1:297] "AW" "ZH" "AF" "A9" ... ## ..$ country : chr [1:297] "Aruba" "Africa Eastern and Southern" "Afghanistan" "Africa" ... ## ..$ region : chr [1:297] "Latin America & Caribbean" "Aggregates" "South Asia" "Aggregates" ... ## ..$ capital : chr [1:297] "Oranjestad" "" "Kabul" "" ... ## ..$ longitude: chr [1:297] "-70.0167" "" "69.1761" "" ... ## ..$ latitude : chr [1:297] "12.5167" "" "34.5228" "" ... ## ..$ income : chr [1:297] "High income" "Aggregates" "Low income" "Aggregates" ... ## ..$ lending : chr [1:297] "Not classified" "Aggregates" "IDA" "Aggregates" ...
str(wdicache, max.level=1)
## List of 2 ## $ series :'data.frame': 24460 obs. of 5 variables: ## $ country:'data.frame': 297 obs. of 9 variables:
wdicache$series
wdicache$country
write_rds(wdicache, "data/wdicache.rds")
wdicache <- read_rds("data/wdicache.rds")
WDIsearch(string = "education", field = "name", short = TRUE, cache = wdicache)
wdibulk <- WDIbulk(timeout=600)
List of 6 $ Data : tibble [24,902,388 × 6] (S3: tbl_df/tbl/data.frame) $ Country :'data.frame': 265 obs. of 31 variables: $ Series :'data.frame': 1486 obs. of 21 variables: $ Country-Series:'data.frame': 8241 obs. of 4 variables: $ Series-Time :'data.frame': 148 obs. of 4 variables: $ FootNote :'data.frame': 741689 obs. of 5 variables:
df_iris <- datasets::iris
df_iris |> group_by(Species) |> summarize(mean_sl = mean(Sepal.Length), mean_sw = mean(Sepal.Width))
df_iris |> mutate(Sepal.Ratio = Sepal.Length/Sepal.Width)
Co2 per capita
WDI template
Choose two WDI codes and analyse the data.
Create an R Notebook of a Data Analysis containing the following and submit the rendered file (eg. w3_g123456.nb.html)
create an R Notebook using the R Notebook Template in Moodle, save as w3_g123456.Rmd,
edit author with name, ID, and title
contents should include the following:
A short abstract
Description of data including data name, data code (or id), description
A bar graph or a column graph, a histogram, a line graph, a scatter plot
add observations or questions for visualizations
run each code block, preview to create w3_g123456.nb.html,
submit w3_123456.nb.html to Moodle.
Submit your R Notebook file (w3_g123456.nb.html) in Moodle (Week Three Assignment).
Due Sunday 14, January 2024, 11:59 PM